KoGra-DB: Using MapReduce for Language Corpora

نویسنده

  • Roman Schneider
چکیده

Linguistic query systems are special purpose IR applications. We present a novel state-of-the-art approach for the efficient exploitation of very large linguistic corpora, combining the advantages of relational database management systems (RDBMS) with the functional MapReduce programming model. Our implementation uses the German DEREKO reference corpus with multi-layer linguistic annotations and several types of text-specific metadata, but the proposed strategy is language-independent and adaptable to large-scale multilingual corpora.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparing Distributed Indexing: To MapReduce or Not?

Information Retrieval (IR) systems require input corpora to be indexed. The advent of terabyte-scale Web corpora has reinvigorated the need for efficient indexing. In this work, we investigate distributed indexing paradigms, in particular within the auspices of the MapReduce programming framework. In particular, we describe two indexing approaches based on the original MapReduce paper, and comp...

متن کامل

A Common Compiler Framework for Big Data Languages: Motivation, Opportunities, and Benefits

We are in the era of Big Data and cluster computing. Data sizes have been growing at an exponential rate. At the same time, growth in computing power has been stagnating due to physical limits in processor technology. The only cost effective way to keep up with the growing data trend has been to harness multiple commodity computers in a shared-nothing configuration. Google, needing to manage ex...

متن کامل

Scalable Language Processing Algorithms for the Masses: A Case Study in Computing Word Co-occurrence Matrices with MapReduce

This paper explores the challenge of scaling up language processing algorithms to increasingly large datasets. While cluster computing has been available in industrial environments for several years, academic researchers have fallen behind in their ability to work on large datasets. We discuss two challenges contributing to this problem: lack of a suitable programming model for managing concurr...

متن کامل

BrackitMR: Flexible XQuery Processing in MapReduce

We present BrackitMR, a framework that executes XQuery programs over distributed data using MapReduce. The main goal is to provide flexible MapReduce-based data processing with minimal performance penalties. Based on the Brackit query engine, a generic query compilation and optimization infrastructure, our system allows for a transparent integration of multiple data sources, such as XML, JSON, ...

متن کامل

Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs

MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature that has been key to the historical succes...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013